Goals & Motivation

Based on a sample of data from online media (blogs, news, twitter), this reports aims to highlight which are the most frequent word and word combinations in modern english. The goal is to build a tool that could efficiently predict the next word of a person who is typing text in real time.

Challenges

Some of the challenges with natural language processing (NLP) are the below :

Some of the computational challenges with NLP :

Datasets

The datasets leveraged for this excercise stem from 3 types of media ranging across the phasm of casual (twitter) to formal (news) language. They can already be considered a sample and therefore no initial sampling will be used at the exploration phase in an attempt to understand the capabilities of a household device.

What are the dimensions of each table and the object size in the device memory ?

The twitter object is 2.614323810^{8} bytes, while the news object is 6.257480110^{8} bytes, and the blog object is 8.281661810^{8} bytes.

  • The twitter table has 743358 rows and 24 columns,
  • The news table has 993085 rows and 85 columns,
  • and the blog table has 891692 rows and 140 columns

all 3 sizes are significant and will need special handling to stay within the computational capabilities of a house-hold system.

Data pre-processing

The next step is to to re-morph data into a dataframe with one variable per row and the count of appearances in the data as its frequency.

!! For news and blog sources, the number of columns cannot be handled in a 8G RAM laptop. Hence I need to split the original tables in two for faster processing (ie columns>60).

In terms of clean up, punctuation, capitalization and numerical characters are excluded to produce a more uniform dataset…

After the clean-up … A sneak peak into the data :

## Source: local data frame [6 x 2]
## 
##    Var1    Freq
##   (chr)   (int)
## 1   the 1748107
## 2   and 1025288
## 3    to 1004097
## 4     a  844858
## 5    of  823929
## 6     i  724972
How much the tables have been compressed after the clean-up?
  1. The Twitter table has been compressed by 56.3%
  2. The News table has been compressed by 52.1%
  3. The Blogs table has been compressed by 65.4%
How many words make up the 50% of each dataset based on their frequency? 90% ? 10%?

In my datasets the 90th percentile, appears for frequencies : 10 , 10 or 15 depending on the table.

Only 1% of my words have a frequency higher than 325.28 , 445 or 626 (ranging for each data table).

The 10th and the 50th percentile appear in all tables for frequency = 1 , which means that my data has a really long tail of words that have appeared only once and are therefore highly unprobable.

Some thoughts before modeling the data

My model could take into account the context of each person typing and for that each source table (twitter, news, blogs) could be leveraged to produce different “context” variables.

In a first take though, I am looking to come-up with a context-less model for simplicity, although performance might be compromised by design. Therefore, I will merge the 3 tables. However, given the different table sizes I am looking to avoid skewing the frequencies towards the table size bias. To do so I need to transform net frequencies into probabilities of words to appear by dividing with the total number of word occurencies.

Twitter : most frequent words

News : most frequent words

Blogs : most frequent words

Combined Media (US, en) : most frequent words

Combined Media (US, en) : most frequent words / comparing sources

Combined Media (US, en) :frequent words’ probability for rank 50-180

Combined Media (US, en) : frequent words / comparing sources for rank 200-500

It is easy to spot by the above visualisation which words are rather specific to a medium ie the word “happy” on twitter and which words are rather close in probabilities across media ie the word “home”.

Combined Media (US, en) : Histogram

Next Steps

Next steps before starting to model the data is start analyzing word combinations, ie ngrams. This will help further boost performance of a potential model as allegedly, the probability of a word combo (of 2,3 or more words) could be used as a better predictor than the probability of a single word.

From this first phase a key learning is that the top 1% of most frequent words is “staples” words (articles, connectors), not providing context. An idea to explore is categorizing ngrams in contextual and context-less so as to develop a model that is aware of its potential to predict (contextual) or not predict (context-less). For example 1. when previous word is context-less ie “the” then the capability to predict is potentially low (equals the probability of a specific noun / that of all nouns) 2. when the previous words is contextual ie “mayor” then the capability to predict is high and equals the probability of any member of the % to all the members of the 1% (largest group amongst all, with an exponential difference!!)

Code

The code for this file can be found on my github : https://github.com/pi-georgia/Capstone-_-Swiftkey.git